Skip to content

feat(bench): flip sandbox-e (schema compression) to ACTIVE — first new ACTIVE since vllm-q4-llama8b#51

Merged
OpenCircuitDev merged 1 commit into
mainfrom
feat/sandbox-e-schema-compression-active
May 9, 2026
Merged

feat(bench): flip sandbox-e (schema compression) to ACTIVE — first new ACTIVE since vllm-q4-llama8b#51
OpenCircuitDev merged 1 commit into
mainfrom
feat/sandbox-e-schema-compression-active

Conversation

@OpenCircuitDev

Copy link
Copy Markdown
Owner

Summary

First sandbox to flip from INACTIVE to ACTIVE since the framework shipped. Sandbox E (schema compression) measures the input-token reduction from OCM's canonical MCP-tool compression recipe, with no model invocation needed for the primary metric — pure deterministic measurement.

Local validation

Ran end-to-end on the actual workload via direct `python bench.py` (no Docker):

Field Value
primary_value 70.12% median reduction
confirm threshold ≥ 30%
verdict CONFIRMED
reason `primary 70.121 >= confirm_at_least 30.0`
tokenizer cl100k_base (real tiktoken)
tokens median 117.5 → 28.5
n_tools 30
duration 35 ms

Spec impact

Spec v0.2 row 21 claimed 30-60% reduction. Measured 70% — exceeds the upper bound. Worth a follow-up note in spec hygiene: the recipe is more aggressive than originally claimed; secondary accuracy validation becomes proportionally more important when model-dependent harness lands.

Frame

First new ACTIVE flip = ~700 lines of work (workload generator + 30-tool fixture + bench.py + compose + expected.json refit). Sets the recipe for the other 12 INACTIVE stubs as their `blocked_on` items resolve.

What this changes

  • 2 ACTIVE sandboxes total (vllm-q4-llama8b + sandbox-e)
  • 12 INACTIVE
  • Workload registry grows: `bench/workloads/mcp-tool-defs-30.jsonl`
  • Reusable generator script: `bench/workloads/_generate_mcp_tool_defs.py`

🤖 Generated with Claude Code

Resolves all 3 blocked_on items the original INACTIVE stub listed,
without needing the full MCP-multiturn-with-model harness:

  - workload curated: bench/workloads/mcp-tool-defs-30.jsonl (30
    representative MCP tool defs across 6 categories — filesystem,
    web, code, calendar, email, system)
  - bench.py: applies canonical schema compression (strip
    descriptions, shorten param names, hide optional params),
    counts tokens before/after via cl100k_base (with deterministic
    char-div-4 fallback), reports median pct reduction
  - docker-compose.yml: minimal python:3.11-slim container with
    tiktoken installed; reads workload from /workloads/, writes
    outputs.json
  - expected.json: status flipped ACTIVE; secondary metric (tool-call
    accuracy delta) explicitly removed and tracked as a future
    paired model-dependent sandbox

Local end-to-end measurement (no Docker, direct python bench.py):
  primary_value: 70.12% median reduction (cl100k_base tokenizer)
  threshold:    confirm_at_least=30%
  verdict:      CONFIRMED — well above the 30% bar

Also locked in this PR:
  - .gitignore: bench/isolation/**/outputs.json (per-run artifact,
    not source of truth — bench/results/ holds the canonical summaries)
  - generator script for the workload (deterministic — re-run produces
    identical output)

Net effect: bench framework now has 2 ACTIVE sandboxes (vllm-q4-llama8b
+ sandbox-e), 12 INACTIVE. dry-run-all reports cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@OpenCircuitDev OpenCircuitDev merged commit 8af15ab into main May 9, 2026
1 check passed
@OpenCircuitDev OpenCircuitDev deleted the feat/sandbox-e-schema-compression-active branch May 9, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants